Verification of outsourced data streams

ABSTRACT

Embodiments disclosed herein are directed to verifying query results of an untrusted server. A data owner outsources a data stream to the untrusted server, which is configured to respond to a query from a client with the query result, which is returned to the client. The data owner can maintain a vector associated with query results returned by the server and can generate a verification synopsis using the vector and a seed. The verification synopsis includes a polynomial, where coefficients of the polynomial are determined based on the seed. The data owner outputs the verification synopsis and the seed to a client for verification of the query results.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention is directed to query verification of untrustedservers, and more specifically to a query verifier that uses a synopsisto verify query results.

2. Brief Discussion of Related Art

Due to the overwhelming flow of information in many data streamapplications, data outsourcing is a natural and effective paradigm forindividual businesses to address the issue of scale. In conventionaldata outsourcing models, the data owner outsources streaming data to oneor more third-party servers, which answer queries posed by a potentiallylarge number of clients on the data owner's behalf. Data outsourcingintrinsically raises issues of trust. Conventional approaches to queryverification build upon cryptographic primitives, such as signatures andcollision-resistant hash functions, which typically only work forcertain types of queries, such as simple selection/aggregation queries.

Conventional industrial and academic Data Stream Management Systems(DSMS) have been developed in recent years. The need for such DSMSs ismainly driven by the continuous nature of the data being generated by avariety of real-world applications, such as telephony and networking.Providing fast and reliable querying services on the streaming data toclients is central to many businesses. However, due to the overwhelmingdata flow observed in most data streams, companies typically do notpossess the necessary resources for deploying a DSMS. In these cases,outsourcing the data stream and the desired computations to athird-party server can be the only alternative. Outsourcing also solvesthe issue of scale. That is, as the number of clients increases, thenumber of mirroring servers employed by the data owner can be increased.In addition, this can often lead to faster query responses, since theseservers can be closer to the clients than a single centralized server.However, because data outsourcing and remote computations raise issuesof trust, outsourced query verification on data streams is a problemwith important practical implications.

For example, a data owner with limited resources, such as memory andbandwidth, may outsource its data stream to one or more remote,untrusted servers that can be compromised, malicious, running faultysoftware, etc. A client registers a continuous query on the DSMS of theserver and receives results upon request. Assuming that the servercharges the data owner according to the computation resources consumedand the volume of traffic processed for answering the queries, theserver then has an incentive to deceive the owner and the client forincreased profit. Furthermore, the server might have a competinginterest to provide fraudulent answers to a particular client. Hence, apassive malicious server could drop query results or provide randomanswers in order to reduce the computation resources required foranswering queries, while a compromised or active malicious server mightbe willing to spend additional computational resources to providefraudulent results (by altering, dropping, or introducing spuriousanswers). In other cases, incorrect answers might simply be a result offaulty software, or due to load shedding strategies, which are essentialtools for dealing with bursty streaming data.

Ideally, the data owner and the client should be able to verify theintegrity of the computation performed by the server using significantlyfewer resources than having the query answered directly, i.e., where thedata owner evaluates the query locally and then transmits the entirequery result to the client. If a client wants to verify the queryresults with absolute confidence, the only solution is for the dataowner to evaluate the query exactly and transmit the entire result tothe client, which obviates the need of outsourcing.

Further, the client should have the capability to tolerate errors causedby load shedding algorithms or other non-malicious operations, while atthe same time being able to identify mal-intended attacks which have asignificant impact on the result.

SUMMARY OF THE INVENTION

Embodiments of the present invention are directed to a method, medium, acomputing system for verifying a query result of an untrusted server. Adata stream is outsourced to the untrusted server, which is configuredto respond to a query with the query result. A verification synopsis isgenerated using at least a portion of the query result and a seed, whichin some embodiments can uses at most 3 words of memory. The verificationsynopsis includes a polynomial, where coefficients of the polynomial aredetermined based on the seed. The verification synopsis and the seed areoutput to a client for verification of the query result. Verification ofthe query result can be performed in a single pass of the query result.A result synopsis can be computed using the query result and the resultsynopsis is compared with the verification synopsis to verify the queryresult.

In some embodiments, the seed can be required by the client to use theverification synopsis and can have a value that is undisclosed by thedata owner until the seed is output to the client. An alarm can beraised when the verification synopsis and the query results do not matchand/or the number of errors between the verification synopsis and thequery results exceeds a threshold. In some embodiments, errors in thequery result can be located and corrected using the verificationsynopsis and/or the number of errors in the query result can beestimated using the verification synopsis.

In some embodiments, a vector can be maintained corresponding to thequery result and the verification synopsis can be computed using thevector when a request for verification is received.

In some embodiments, layers can be generated, where each layer includesbuckets. One of the buckets can be represented by the verificationsynopsis and elements of the vector are assigned to one bucket perlayer.

Other objects and features of the present invention will become apparentfrom the following detailed description considered in conjunction withthe accompanying drawings. It is to be understood, however, that thedrawings are designed as an illustration only and not as a definition ofthe limits of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an exemplary data outsourcing architecture 100.

FIG. 2 is an exemplary illustration of a synopsis that includes layersand buckets.

FIG. 3 is an exemplary computing device for implementing embodiments ofthe present invention.

DETAILED DESCRIPTION

Exemplary embodiments of the present invention are illustrated using“GROUP BY, COUNT” and “GROUP BY, SUM” queries known to those skilled inthe art, although it will be recognized that other queries can be used.In network monitoring applications, a computation of the total number ofpackets originated and destined to certain IP addresses is oftendesired. The “GROUP BY, COUNT” query is substantially equivalent tocomputing the frequencies (e.g., occurrences) of tuples in a stream.Most streaming algorithms deal with either the frequencies directly ortheir relatives, such as frequency moments, heavy hitters, quantiles,and inverse frequencies, which focus on computing the answers to thesequeries, but not their verification.

Embodiments of the present invention include a query verifier that usesalgebraic and probabilistic techniques to compute a small synopsis onthe true query result, which can be communicated to the client forverification of whether the query result returned by the server arecorrect. A synopsis is a small summary based on a query result that ismaintained and updated by the data owner and includes a polynomial withcoefficients that are based on query results and a random seed. As usedherein, a “verification synopsis refers to a synopsis generated by thedata owner using a vector based on query results and a “result synopsis”refers to a synopsis generated by a client using query results. Apolynomial is a mathematical expression having constants and variablesforming terms, each term consisting of a constant multiplier and one ormore terms raised to a power, where the terms are separated bymathematical operations. A seed is a random number defined by the dataowner that is required to perform the verification of query resultsusing the synopsis. Embodiments of the present invention advantageouslyprovide high-confidence probabilistic solutions with arbitrarilyminuscule probability of error, and develop verification algorithms thatincur minimal resources, in terms of both the memory consumption of thedata owner and the data owner-client network bandwidth.

FIG. 1 is an exemplary data outsourcing architecture 100, in which anuntrusted server 110 maintains and makes available, to at least oneclient 120, outsourced data steams 102 associated with a data owner 130.Outsourcing refers to sending information to an independent third partysuch that the third party performs the operations typically performed bythe data owner. The untrusted server, client, and data owner can becomputing devices, the implementation of which is known to those skilledin the art. The untrusted server 110 is a server that operatesindependent of the data owner 130 as a third party to provide clients120 of the data owner 130 with query results. A query result is aresponse from the untrusted server to a query that includes informationrequested by a client. As a result neither the client 120 nor the dataowner 130 can be certain that the query results are valid. The untrustedserver 110 can implement a Data Stream Management System (DSMS) 112,which stores and manages data stream information provided by a dataowner and retrieves the information in response to queries. A datastream is continuous flow of discrete pieces of information from onecomputing device to another computing device. The data stream can have abeginning and an end, where all information sent between the beginningand the end are considered to be part of the data stream. The client 120can register a query 114 with the untrusted server 110 and can receivequery results 116 when requested. The data owner generally does not havecontrol of the query results provided by the untrusted server 110. Insome embodiments, the data owner can also act as the client. In theseembodiments, the data owner typically does not have enough processingpower and resources to run a DSMS itself, but has enough resources tocompute the verification synopses instead.

To ensure that the untrusted server 110 is providing valid results, thedata owner 130 includes a query verifier 132, which generates andmaintains a randomized synopsis 134 of the query results. The queryverifier 132 generates the randomized synopsis 134, which is referred toherein as a “polynomial identity random synopsis” (PIRS), to raise analarm with very high confidence if errors in the query results 116exist. Specifically, the data owner 130 preferably maintains thesynopsis 134 at a constant size of three machine words (e.g., 3 bytes),and transmits the synopsis 134 to the client 120 preferably via a securechannel upon a verification request 122. The query result 116 can beverified using this small synopsis 134. The synopsis 134 can bemaintained using constant space and low processing cost per tuple in thestream (O(1) for count queries and O(log n) or O(log μ) for sum queries,where n is the number of possible groups and μ is the update amount pertuple, and O denotes an asymptotic upper bound. In some embodiments, thesynopsis 134 can be used for verifying multiple simultaneous querieswith the same aggregate attribute but different group-by partitioning,where the size of the synopsis 134 is the same as that for verifying asingle query.

In some embodiments, the query verifier 132 can implement a buffer 136for data streams that exhibit a large degree of locality. The buffer 136can be used to store exact aggregate results for a small number ofgroups. With data locality, a large portion of updates hit the buffer136. Whenever the buffer 136 is full and a new group needs to beinserted, a victim is evicted from the buffer using a simple leastrecently used (LRU) policy. Only then does the evicted group updatePIRS, using the overall aggregate value computed within the buffer 136.The buffer 136 is emptied to update PIRS whenever verification isrequired.

Embodiments of the query verifier 132 can use generalizations of thesynopsis 134, using the basic PIRS as a building block to createdifferent verification schemes. In some embodiments, a generalization ofPIRS is generated for raising alarms when a number of errors exceed apredefined threshold γ. This synopsis, referred to herein as PIRS^(γ),allows the server 110 to return query results having a predefined marginof error, which can result from, for example, using semantic loadshedding. When the number of errors is less than the predefinedthreshold γ, the server is still considered to be trustworthy.

In some embodiments, a weaker version of PIRS^(γ), referred to herein asPIRS^(±γ), can be generated for allowing the server 110 to return queryresults having an approximate predefined margin of error, where if theerrors are fewer than γ⁻ the answer is considered valid, while if theerrors are more than γ⁺ the answer is considered invalid.

In some embodiments, a strengthened version of PIRS^(γ), referred toherein as PIRS^(γ)*, can be generated so that when the number of errorsis tolerable, the errors can be located and even corrected. WhenPIRS^(γ)* is implemented, it can also act as an error-correcting code,which can guarantee that the complete and correct query results can bedelivered to the client 120 when the number of errors is less than thepredefined threshold γ.

In some embodiments, an embodiment of the synopsis 134, referred toherein as FM-PIRS, can be used to estimate the actual number of errorsaccurately. FM-PIRS is a parameter-free version of PIRS^(±γ) in thesense that it does not depend on the predefined threshold γ. Inparticular, when the number of errors exceeds the predefined thresholdγ, PIRS^(±γ) raises an alarm, while FM-PIRS also reports an estimate ofthe actual number of errors. FM-PIRS has a smaller size than PIRS^(±γ)for large enough threshold γ.

Exemplary queries used to illustrate the query verifier 132 can have thefollowing structure:

-   -   SELECT AGG(A_(—)1), . . . , AGG(A_N) FROM T    -   WHERE . . . GROUP BY G_(—)1, . . . , G_M

GROUP BY aggregate queries can have wide applications in monitoring andstatistical analysis of data streams (e.g., in networking and telephonyapplications). An example of such a query that appears frequently innetwork monitoring applications is the following:

-   -   SELECT SUM(packet_size) FROM IP_Trace    -   GROUP BY source_ip, destination_ip (*)

The above query (*) is used for illustrating exemplary embodiments,where sum and count are used, and is referred to herein as the“illustrative query”. Other aggregates that can be converted to sum andcount, such as average, standard deviation, and the like, can besupported, by verifying each component separately (i.e., verifying thesum and the count in the case of average).

The “GROUP BY” predicate partitions streaming tuples into a set of ngroups, computing one sum per group. The data stream can be viewed as asequence of additions (or subtractions) over a set of items in [n]={1, .. . , n}). This data stream S and its τ-th tuple is denoted herein ass^(τ)=(i, u^(τ)), an update of amount u^(τ) to the ith group. The queryanswer can be expressed as a dynamic vector of non-negative integersv^(τ)=|ν₁ ^(τ), . . . , ν_(n) ^(τ)|∈

containing one component per group aggregate. Initially, v⁰ is the zerovector. A new tuple s^(τ)=(i, u^(τ)) updates the corresponding group iin v^(τ) as ν_(i) ^(τ)=ν_(i) ^(τ−1)+u^(τ). The amount u^(τ) can eitherbe positive or negative, but require ν_(i) ^(τ)≧0 for all τ and i. Whencount queries are concerned, u^(τ) is substantially equal to 1 for allτ. An assumption can be made that the L₁ norm of v^(τ) is always boundedby some large m, i.e., at any

${{\tau_{i}{v^{\tau}}1_{1}} = {{\sum\limits_{i = 1}^{n}\;\upsilon_{i}^{\tau}} \leq m}},$

The following definition captures the semantics of Continuous QueryVerification (CQV):

Formal Definition 1. Given a data stream S, a continuous query Q and auser defined parameter

${\delta \in ( {0,\frac{1}{2}} )},$a synopsis X of vector v is designed such that for any τ, given anyw^(τ) and using X(v^(τ)), an alarm is raised with probability at least1−δ if w^(τ)≠v^(τ) and an alarm is not raised if w^(τ)=v^(τ).

In some embodiments w^(τ) can be, for example, an answer provided by theserver 110, while X(v^(τ)) can be the synopsis 134 communicated to theclient 120 from the data owner 130 for verifying vector v^(τ).

Using the above definition, an alarm can be raised with high probabilityif any component (or group answer) ν_(i) ^(τ) is inconsistent. As anexample, consider a server that is using semantic load shedding, i.e.,dropping tuples from certain groups, on bursty stream updates. In thisexample, the aggregate of a certain, small number of components will beinconsistent without malicious intent. In some embodiments, a certaindegree of tolerance in the number of erroneous answers contained in thequery results is allowed, rather than raising alarms indistinctly. Thefollowing definition captures the semantics of Continuous QueryVerification with Tolerance for a Limited Number of Errors (CQV ^(γ)):

Formal Definition 2. For any w, v∈

let E(w, v)={i|w_(i)≠ν_(i)}. Then define w≠_(γ)v if and only if |E(w,v)|≧γ and w=γ v if and only if |E(w, v)|<γ. Given a data stream S, acontinuous query Q, and user defined parameters

${\gamma \in {\{ {1,\ldots\mspace{14mu},n} \}{\mspace{11mu}\;}{and}{\mspace{11mu}\;}\delta} \in ( {0,\frac{1}{2}} )},$a synopsis X of vector v is designed such that, for any τ, given anyw^(τ) and using X(v^(τ)), an alarm is raised with probability at least1−δ, if w^(τ)≠_(γ)v^(τ); and an alarm is not raised if w^(τ)=_(γ)v^(τ).

Formal definition 1 is a special case of formal definition 2, where thethreshold γ=1.

Some embodiments can support random load shedding (i.e., can toleratesmall absolute or relative errors on any component irrespective of thetotal number of inconsistent components. The following definitioncaptures the semantics of Continuous Query Verification with Tolerancefor Small Errors (CQV^(η)).

Formal Definition 3. For any w, v∈

let w≈_(η)v if and only if there is some i such that |w_(i)−v_(i)|>η,and w≈ηv if |w_(i)−v_(i)|≦η for all i⊖[n]. Given a data stream S, acontinuous query Q, and user defined parameters η and

${\delta \in ( {0,\frac{1}{2}} )},$a synopsis Z of vector v can be designed such that, for any τ, given anyw^(τ) and using X(v^(τ)), an alarm is raised with probability at least1−δ, if w^(τ)≈_(η)v^(τ); and an alarm is not raised if w^(τ)≈_(η)v^(τ).

The formal definition 3 requires the absolute errors for each ν_(i) ^(τ)to be no larger than η. It is also possible to use relative errors(i.e., raise an alarm if and only if there is some i such that |w_(i)^(τ)−ν_(i) ^(τ)|/|ν_(i) ^(τ)|>η). Thus, formal definition 1 is also aspecial case of formal definition 3 with η=0.

Those skilled in the art will recognize that further variations andcases can also be defined. For example, one may wish to bound the sum ofthe absolute errors, or bound both the number and the size of theerrors. Under a standard RAM model, used for illustrative purposes, itis assumed that addition, subtraction, multiplication, division, ormodular arithmetic operations involving two words take one unit of time.It is also assumed that n/δ and m/δ fit in a word.

Embodiments of the Polynomial Identity Random-Synopsis (PIRS) can bedenoted by X(v). The synopsis 134 is based on testing the identity ofpolynomials by evaluating the polynomials at a randomly chosen point.The technique of verifying polynomial identities is well-known to thoseskilled in the art. PIRS can have two variants, named PIRS-1 and PIRS-2,respectively.

For PIRS-1, let p be some prime such that max{m/δ, n}<p. For the spaceanalysis, let p≦2 max{m/δ, n}. For PIRS-1, the field

is used, where additions, subtractions, and multiplications are donemodulo p. For the first PIRS, denoted PIRS-1, a seed α is chosen from

uniformly at random and X(v) is computed incrementally from X(v^(τ−1))and s^(τ)=(i, u^(τ)) as:X(v ^(τ))=X(v ^(τ−1))(α−i)^(u) ^(τ) X(0)=X(v ⁰)=1  (1)X(v)=(α−1)^(ν) ¹ ·(α−2)^(ν) ² . . . (α−n)^(ν) ^(n) .  (2)

The values of n, m, δ, p are known to the data owner 130 and the clients120. The data owner 130 secretly picks a seed α and maintains X(v). Upona verification request 122, the data owner 130 returns the synopsis 134(e.g., PIRS) to the client 120, which can include only two words: seed αand X(v). Given any answer w (i.e., query results 116) returned by theserver 110, the client 120 can use PIRS to check if w=v with highprobability, by computingX(w)=(α−1)^(w) ¹ ·(α−2)^(w) ² · . . . (α−n)^(w) ^(n) .  (3)

If

${{\sum\limits_{i = 1}^{n}\; w_{i}} > m},$the answer w is rejected immediately as being erroneous. If X(w)=X(v),then it is declared that w=v. Otherwise, an alarm is raised. Using thisapproach, a false alarm is never raised, and it can be shown a truealarm will be missed with a probability at most δ. Given any w≠v, PIRSraises an alarm with probability at least 1−δ. To illustrate this,consider polynomials ƒ_(v)(x)=(x−1)^(ν) ¹ (x−2)^(ν) ² . . . (x−n)^(ν)^(n) and ƒ_(w)(x)=(x−1)^(w) ¹ (x−2)^(w) ² . . . (x−n)^(w) ^(n) . Since apolynomial with 1 as its leading coefficient (i.e., the coefficient ofthe term with the largest degree) is completely determined by its zeroes(with multiplicities), ƒ_(v)(x)≡ƒ_(w)(x) if and only if v=w. Since bothƒ_(v)(x) and ƒ_(w)(x) have degree at most m, ƒ_(v)(x)=ƒ_(w)(x) happensat no more than m values of x if v≠w, due to the fundamental theorem ofalgebra. Since we have p≧m/δ choices for seed α, the probability thatX(v)=X(w) happens is at most δ over the random choice of seed α.

The update time to maintain X(v) as new updates are observed can bedetermined as follows. For count queries, each tuple increments one ofthe v_(i)'s by one, so the update cost is constant (one subtraction andone multiplication). For sum queries, a tuple s=(i, u) increases v_(i)by u, so (α−i)^(u) is computed, which can be done in O(log u)(exponentiation by repeated squaring) time. To perform a verificationwith the answer w, (x−i)^(wi) is computed for each nonzero entry w_(i)of the answer w, which takes O(log w_(i)) time, so the time needed for averification is

${O( {\sum\;{\log\; w_{i}}} )} = {{O( {{w}\log\frac{m}{w}} )}.}$Since both X(v) and seed α are smaller than p, the space complexity ofthe synopsis is

$O( {{\log\frac{m}{\delta}} + {\log\; n}} )$bits.

PIRS-1 occupies

$O( {{\log\frac{m}{\delta}} + {\log\; n}} )$bits of space, spends O(1) (resp. O(log u)) time to process a tuple forcount queries, and

$O( {{w}\log\frac{m}{w}} )$time to perform a verification.

When u is negative (or when handling deletions for count queries), thefield

may not be equipped with division. In this case, (α−i)

is computed, which is the multiplicative inverse of (α−i) in

, in O(log p) time, and then compute (α−i)^(−1·|u|).

For PIRS-2, space usage can be improved using PIRS-2 when n<<m, in whichthe prime p is chosen between max {m, n/δ} and 2 max {m, n/δ}. For aseed α chosen uniformly at random from

, the following can be computed.X(v)=ν₁α+ν₂α²+ . . . +ν_(n)α^(n).

By adding on ua^(i) in response to update s=(i, u), the above synopsisis straightforward to maintain over a stream of updates. PIRS-2 has anO(log n) update cost for both count and sum queries, since ua^(i) iscomputed for a tuple (i, u) in the stream. PIRS-2 occupies

$O( {{\log\; m} + {\log\frac{n}{\delta}}} )$bits of space, spends O(log n) time to process a tuple, and O(|w|log n)time to perform a verification.

One property of either variant of PIRS (e.g., PIRS-1 or PIRS-2) is thatthe verification can be performed in one pass of the answer w using aconstant number of words of memory. This is advantageous when |w| islarge. The client 120 can receive the answer w in a streaming fashion,verify it online, and either forward it to a dedicated server 140(FIG. 1) for further processing, or a network storage device 150(FIG. 1) for offline use.

Embodiments of the synopses solving the CQV problem with errorprobability of at most δ keep

$\Omega\mspace{11mu}( {\log\frac{\min\{ {m,n} \}}{\delta}} )$bits. To illustrate this, assume that the vector v and the answer w areboth taken from a universe U, and let M be the set of all possiblememory states the synopsis can have. Any synopsis X can be seen as afunction ƒ:U→M; and if X is randomized, it can be seen as a functionrandomly chosen from a family of such functions F={ƒ₁, ƒ₂, . . . ,ƒ_(l)}, where ƒ_(i) is chosen with probability p(ƒ_(i)). Without loss ofgenerality, assume that p(ƒ₁)≧p(ƒ₂)≧ . . . ≧p(ƒ_(l)). The synopsis Xneeds at least log |M| bits to record the output of the function and log|F| bits to describe the function chosen randomly from F.

For any w≠v∈u, let F_(w,v)={ƒ∈F|ƒ(w)=·ƒ(v)}. For a randomized synopsis Xto solve CQV with error probability at most δ, the following must holdfor all w≠v∈u:

$\begin{matrix}{{\sum\limits_{f\; \in \;{\mathcal{F}\;}_{w,v}}\;{p(f)}} \leq {\delta.}} & (5)\end{matrix}$

Focusing on the first k=┌δ|F|┐+1 functions ƒ₁, . . . , ƒ_(k), it can beseen that

${\sum\limits_{i = 1}^{k}{p( f_{i} )}} > {\delta.}$Since there are a total of |M|^(k) possible combinations for the outputsof these k functions|u|≦|M| ^(k)  (6)so that no two w≠v∈u have ƒ_(i)(w)=ƒ_(i)(v) for all i=1, . . . , k;otherwise an answer w and a vector v can be found that violate equation(5). Taking the log of both sides of equation (6), the following resultslog|u|≦(┌δ·|F|┐+1)log|M|.  (7)

Since vector v has n entries whose sum is at most m, by simplecombinatorics,

${{{??}} \geq \begin{pmatrix}{m + n} \\n\end{pmatrix}},$or log|u|≧min {m, n}. The following tradeoff can therefore be obtained.|F|·log|M|=Ω(min{m,n}/δ).  (8)

If log·|F|≦(1−ε)log(min{m, n}/δ) (i.e., |F|≦(min{m, n}/δ)^(1-ε)) for anyconstant ε∈(0, 1), then X uses super-polylogarithmic space log|M|=Ω((min{m, n}/δ)^(ε)); else X keeps log|F|≧log(min{m, n})δ) randombits. Therefore, when m≦n, PIRS-1 is optimal as long as

${{\log\mspace{14mu} n} = {O( {\log\mspace{14mu}\frac{m}{\delta}} )}};$and when m>n, PIRS-2 is optimal as long as

${\log\mspace{14mu} m} = {{O( {\log\frac{n}{\delta}} )}.}$The bounds are not tight when

${\log\mspace{11mu}\frac{m}{\delta}} = {o( {\log\mspace{14mu} n} )}$or

${\log\frac{n}{\delta}} = {{o( {\log\mspace{14mu} m} )}.}$

The analysis above focuses on bit-level space complexity. The value of pis chosen to be the maximum prime that fits in a word, so as to minimizeδ, where δ=m/p for PIRS-1 and δ=n/p for PIRS-2. For example, if 64-bitwords are used and m<2³², then δ is at most 2⁻³² for PIRS-1, which makesany error highly unlikely (e.g., 1 in four billion). For speedconsideration, careful choice of p can allow faster implementation. Forexample, choosing p to be a Mersenne prime (e.g., p is of the formp=2^(l)−1 for some l) allows the modulo arithmetic to be performed usingsimple addition and subtraction operations.

Since the group id i is extracted from each incoming tuple directly,without the use of a dictionary (which would increase the memory cost),the size of the group space, n, needs to be large for certain queries.For example, the exemplary query discussed above has a group space ofn=2⁶⁴ (the combination of two IP addresses), although the actual numberof nonzero entries |v| may be much less than n. In this case, since m istypically much smaller, PIRS-1 is the better choice in this example.

Embodiments of the present invention can be used for handling multiplequeries simultaneously. For example, consider a number of aggregatequeries on a single attribute (e.g., packet size), but with differentpartitioning on the input tuples (e.g., source/destination IP andsource/destination port). Let Q₁, . . . , Q_(k) be k such queries, andlet the i-th query partition the incoming tuples into n_(i) groups for atotal of

$\sum\limits_{i = 1}^{k}n_{i}$groups. In some embodiments, PIRS can be applied once per query, usingspace linear in k. In other embodiments, the queries can be treated asone unified query of n groups so that one PIRS is used to verify thecombined vector v. The time cost for processing one update increaseslinearly in k, since each incoming tuple is updating k components ofvector v at once (one group for every query in the worst case):

Using PIRS-1 for k queries can occupy

$O( {{\log\;\frac{m}{\delta}} + {\log\mspace{14mu} n}} )$bits of space, spend O(k) (resp. O(k log u)) time to process a tuple forcount (sum) queries, and

$O( {{w}\log\;\frac{m}{w}} )$time to perform a verification. As a result, multiple queries can beeffectively verified with a few words of memory and communication.

After an error has been detected, the client 120 can choose to disclosethis information to the server 110. If the error is not reported, thenthe probability of detecting an error remains 1−δ. However, errors canoccur due to faulty software or bad communication links, and may not beintentional. In this case, it can be beneficial to give a warning to theserver 110. Since an adversary can extract knowledge from this warning(e.g., it knows at least that the same response on the same data willalways fail), the guarantee of detecting an error with the probabilityof 1−δ does not strictly hold. In order to restore the 1−δ success rateafter a reported attack, the synopsis 134 has to be recomputed fromscratch, which is not typically possible in a streaming setting. Hence,it can be important to rigorously quantify the loss of guarantee after aseries of warnings have been sent out without resetting the synopsis.

To achieve this let e_(k)=1 if the k-th attack goes undetected ande_(k)=0 otherwise. Let p_(k) be the probability that the server succeedsin its k-th attack after k−1 failed attempts, (i.e.,p_(k)=Pr[e_(k)=1|e₁=0, . . . , e_(k−1)=0]). Therefore, it can bedetermined that p₁≦δ. To demonstrate the strength of one embodiment ofthe present invention, p_(k) is bounded by an upper limit with respectto the most powerful server A. It can be assumed that the server A knowshow PIRS works except its random seed, α; maximally explores theknowledge that could be gained from one failed attack; and possessesunbounded computational power.

The best server A could do to improve p_(k) over multiple attacks isquantified. The space of seeds used by PIRS is denoted R. For any answerw and vector v, the set of witnesses is denoted as W(w, v)={r∈

|PIRS raises an alarm on r} and the set of non-witnesses is denoted asW(w, v)=R−W(w, v). Note that | W(w, v)|≦δ|

if w≠v, and W(w, v)=

if w=v. Suppose the seed PIRS uses is r. Server A returns a correctanswer w=v, server A cannot infer anything about r. If server A returnssome w≠v and gets a warning, it is possible that server A can determiner∉ W(w, v). However, even if it is assumed that server A has enoughcomputational power to compute both the sets of witnesses andnon-witnesses, it can be impossible for server A to infer which witnessPIRS is using as r. After k−1 failed attacks using w₁, . . . , w_(k−),the set of seeds that server A has ruled out is ∪_(i=1) ^(k−1) W(w_(i),v_(i)), whose cardinality is at most (k−1)δ|R|. Thus,

${p_{k} \leq \frac{\delta}{1 - {( {k - 1} )\delta}}},$where

$\begin{matrix}\begin{matrix}{p_{k} = \frac{{{set}\mspace{14mu}{of}\mspace{14mu}{non}\text{-}{witnesses}}}{{{set}\mspace{14mu}{of}\mspace{14mu}{remaining}\mspace{14mu}{seeds}}}} \\{= {\frac{{\overset{\_}{??}( {w_{k},v_{k}} )}}{{\mathcal{R} - {\bigcup_{i = 1}^{k - 1}{\overset{\_}{??}( {w_{i},v_{i}} )}}}} \leq {\frac{\delta}{1 - {( {k - 1} )\delta}}.}}}\end{matrix} & (9)\end{matrix}$

Assuming that server A has made a total of k attacks to PIRS for any k,the probability that none of them succeeds is at least 1−kδ.

This probability is

$\begin{matrix}\begin{matrix}{{\Pr\lbrack {e_{1} = {{0 ⩓ \mspace{11mu}\ldots\; ⩓ e_{k}} = 0}} \rbrack} = {\prod\limits_{i = 1}^{k}\;( {1 - {\Pr\lbrack {{e_{i} = { 1 \middle| e_{1}  = 0}},\ldots\mspace{11mu},{e_{i - 1} = 0}} \rbrack}} )}} \\{\geq {\prod\limits_{i = 1}^{k}( {1 - \frac{\delta}{1 - {( {i - 1} )\delta}}} )}} \\{= {\prod\limits_{i = 1}^{k}\frac{1 - {i\;\delta}}{1 - {( {i - 1} )\delta}}}} \\{= {{\frac{1 - \delta}{1} \cdot \frac{1 - {2\;\delta}}{1 - \delta}}\mspace{14mu}\ldots\mspace{11mu}\frac{1 - {k\;\delta}}{1 - {( {k - 1} )\delta}}}} \\{= {1 - {k\;{\delta.}}}}\end{matrix} & (10)\end{matrix}$

The above shows that PIRS is very resistant towards coordinated multipleattacks, even against an adversary with unlimited computational power.For a typical value of δ=2⁻³², PIRS could tolerate millions of attacksbefore the probability of success becomes noticeably less than 1. Thedrop in the detection rate to 1−kδ occurs only if the client chooses todisclose the attacks to the server. Such disclosure is not required inmany applications.

The PIRS can be extended to support sliding windows. PIRS-1 for countqueries is used as an illustrative example. Those skilled in the artwill recognize that the extension for sliding windows to sum queries, aswell as to PIRS-2, PIRS^(γ), and PIRS^(±γ) can also be similarlyimplemented.

One property of PIRS-1 is that it is decomposable (i.e., for any v₁, v₂,X(v₁+v₂)=X(v₁) X(v₂) and for PIRS-2, X(v₁+v₂)=X(v₁)+X(v₂)). Thisproperty allows PIRS to be extended for periodically sliding windowsusing standard techniques. One example of a sliding window query mightbe the following.

-   -   SELECT SUM(packet size) FROM IP_Trace    -   GROUP BY source_ip, destination_ip    -   WITHIN LAST 1 hour SLIDE EVERY 5 minutes

In this example, PIRS-1 can be built for every 5-minute period, and canbe kept in memory until it expires from the sliding window. Assume thatthere are k such periods in the window, and let X(v₁), . . . , X(v_(k))be the PIRS for these periods. In addition, the data owner maintains theoverall

${{PIRS}\mspace{14mu}{X(v)}} = {\prod\limits_{i = 1}^{k}\;{{\chi( v_{i} )}.}}$When a new PIRS X(v_(k+1)) completes, X(v) is updated as X(v):=X(v)X(v_(k+1)) (X(v₁))⁻¹. For a periodically sliding window query with kperiods, the synopsis uses

$O( {k( {{\log\frac{m}{\delta}} + {\log\mspace{14mu} n}} )} )$bits of space, spends O(1) time to process an update, and

$O( {{w}\log\frac{m}{w}} )$time to perform a verification.

For various window sizes consisting of between 1 to k periods the kperiods can be decomposed into a number of dyadic intervals. Forsimplicity assume that k is a power of 2. These intervals can beorganized into l=log k levels. On level 0, there are k intervals eachconsisting of one period; on level i, 1≦i≦l−1, there are k/2^(i)intervals, each spanning 2^(i) periods. There are a total of 2^(k)−1such dyadic intervals for this example. One PIRS is built for eachinterval, so the total size of the entire synopsis is still

${O( {k( {{\log\frac{m}{\delta}} + {\log\mspace{14mu} n}} )} )}.$Since a PIRS at level i+1 can be computed in constant time from twoPIRS's at level i, the amortized update cost remains O(1). Upon averification request with a window size of q periods, the window can bedecomposed into at most O(log k) dyadic intervals, and thosecorresponding PIRS's can be combined together to form the correctsynopsis for the query window. To support sliding window queries withvarious window sizes of up to k periods, the synopsis uses

$O( {k( {{\log\frac{m}{\delta}} + {\log\mspace{14mu} n}} )} )$bits of space, spends O(1) time to process an update, and O(log k) timeto assemble the required synopsis upon a verification request. Theclient spends

$O( {{w}\log\frac{m}{w}} )$time to perform a verification.

In some embodiments, a synopsis that is tolerant for a few errors,solving the CQV^(γ) problem can be implemented. To achieve this, let thethreshold γ be the number of components in vector v that are allowed tobe inconsistent. In some embodiments, a construction is presented thatgives an exact solution that satisfies the requirements of CQV^(γ), andrequires

$O( {\gamma^{2}\log\frac{1}{\delta}\log\mspace{14mu} n} )$bits of space. This synopsis can be strengthened so that errors can belocated and even corrected. This exact solution uses space quadratic inγ. In other embodiments, an approximate solution which uses only

${O( {\gamma\mspace{11mu}\log\frac{1}{\delta}( {{\log\mspace{14mu} m} + {\log\mspace{14mu} n}} )} )}\mspace{11mu}$bits can be implemented. In other embodiments, a synopsis that canestimate the number of errors can be implemented that usespolylogarithmic space and does not depend on the threshold γ.

Using PIRS as a building block, the synopsis 134 can be constructed thatsatisfies the requirements of CQV^(γ). Referring to FIG. 2, thisembodiment of the synopsis 134, referred to herein as “PIRS^(γ)”,consists of multiple layers 200, where each layer 200 contains k=c₁γ²buckets 210 (c₁≧1 is a constant to be determined shortly). Eachcomponent of vector v is assigned to one bucket 210 per layer 200, andeach bucket 210 is represented using only its PIRS synopsis. PIRS^(γ)raises an alarm if at least γ buckets in any layer raise an alarm. Ifthere are fewer than γ errors, no layer will raise an alarm, and ifthere are more than γ errors, at least one of the layers raises an alarmwith high probability (when the γ inconsistent components do not collideon any bucket for this layer). By choosing the probability of failure ofthe individual PIRS synopsis carefully, it can guarantee that PIRS^(γ)achieves the requirements of formal definition 2.

Examining one layer of PIRS^(γ), let b be a pairwise independent hashfunction which maps the range {1, . . . , n} uniformly onto {1, . . . ,k}. PIRS^(γ) assigns v_(i) to the b(i)-th bucket, and for each bucketcomputes the PIRS synopsis of the assigned subset of v_(i)'s withprobability of failure δ′=1/(c₂γ) where c₂≧1 is a constant. Using PIRS-1as an example, each of these k synopses occupies

${O( {{\log\frac{m}{\delta^{\prime}}} + {\log\mspace{14mu} n}} )} = {O( {{\log\mspace{14mu} m} + {\log\mspace{14mu} n}} )}$bits. Given some w=_(γ)v, since there are fewer than γ errors, no alarmis raised. Constants c₁ and c₂ can be chosen such that if w≠_(γ)v, thenan alarm is raised with probability at least ½ for this layer. In thiscase there are two cases when the query verifier fails to raise analarm. First where there are fewer than γ buckets that contain erroneouscomponents of w. Second where there are at least γ buckets containingerroneous components but at least one of them fails due to the failureprobability of PIRS. Setting constants c₁, c₂=4.819, either of the abovecases occurs with probability at most ¼. In the first case, the v_(i)'sare assigned to the buckets in a pairwise independent fashion, and itcan be guaranteed that the mapping of the γ erroneous components ontothe k buckets is injective with probability

$\begin{matrix}\begin{matrix}{{1 - ( {1 - \frac{1}{k}} )^{\frac{\gamma{({\gamma - 1})}}{2}}} \leq {1 - ( {1 - \frac{1}{c_{1}\gamma^{2}}} )^{\gamma^{2/2}}}} \\{\leq {1 - 2^{{- 1}/c_{1}}}} \\{{\leq \frac{1}{4}},}\end{matrix} & (11)\end{matrix}$where the last inequality holds by the choice of c₁. In the second casethe probability that some of the γ buckets that are supposed to raise analarm fail is:

$\begin{matrix}{{{1 - ( {1 - \delta^{\prime}} )^{\gamma}} = {{1 - ( {1 - \frac{1}{c_{2}\gamma}} )^{c_{2}{\gamma/c_{2}}}}\mspace{135mu} \leq {1 - 2^{- \frac{2}{c_{2}}}}\mspace{135mu} < \frac{1}{4}}},} & (12)\end{matrix}$which holds as long as c₂≧4.819.

Therefore, using one layer, PIRS^(γ) raises an alarm with probability atleast ½ on some w≠_(γ)v, and will not raise an alarm if w=_(γ)v. Byusing log(1/δ) layers and reporting an alarm if at least one of theselayers raises an alarm, the probability is boosted to 1−δ. So, for anyw≠_(γ)v, PIRS^(γ) raises an alarm with probability at least 1−δ, and forany w=_(γ)v, PIRS^(γ) does not raise an alarm.

In addition to the k log(1/δ) PIRS synopses, a hash function b mappingupdates to buckets is generated. This is achieved by picking x and yuniformly at random from

, and computing b(i)=xi+y mod p mod k, where “mod” is used wherein formodulo. This generates a function that is pairwise-independent over therandom choices of x and y. Verification can be performed by computing,in parallel, for the layers while making one pass over the answer w.Initialization, update, and verification for PIRS^(γ) appear inpsuedcode below.

Algorithm 1: PIRS^(γ)-INITIALIZE(Prime p, Threshold γ) 1 c = 4.819,k =┌cγ²┐ 2 Generate x and y uniformly at random from

3 for l = 1,...,┌log 1/δ┐ do 4  | Layer L_(l) = [X₁(v) := 0,...,X_(k)(v):= 0]  |_ // X_(j)(v) is a PIRS synopsis with δ′ = 1/cγ

Algorithm 2: PIRS^(γ)-UPDATE(Tuple s = (i, u)) 1 for l = 1,...,┌log 1/δ┐do 2  | b_(l)(i) = xi + y mod k + 1 3  |_ Update L_(l).X_(b) _(l)_((i))(v) using s

Algorithm 3: PIRS^(γ)-VERIFY(Vector w) 1 for l = 1,...,┌log 1/δ┐ do 2 | Layer M_(l) = [X₁(w) := 0,...,X_(k)(w) := 0]  | // X_(j)(w) is a PIRSsynopsis with δ′ = 1/cγ 3  | for i = 1,...,n do 4  |  | b_(l)(i) = xi +y mod k + 1 5  |  |_ Update M_(l).X_(b) _(l) _((i))(w) by s = (i, w_(i))6  |_ if |{j | L_(i).X_(j)(v) ≠ M_(i).X_(j)(w),1 ≦ j ≦ k}| ≧ γ  thenRaise an alarm

PIRS^(γ) requires

$O( {\gamma^{2}\log\frac{1}{\delta}( {{\log\mspace{14mu} m} + {\log\mspace{14mu} n}} )} )$bits, spends

$O( {\log\frac{1}{\delta}} )$time to process a tuple in the stream, and

$O( {{w}( {\gamma + {\log\frac{m}{w}}} )\log\frac{1}{\delta}} )$time to perform a verification. Careful analysis can facilitate asmaller constant in the asymptotic cost above. For a given γ, theminimum k is chosen such that equation (11) is at most ½, and 1/δ′ ischosen to be very large (close to the maximum allowed integer) so thatequation (12) is almost zero. For instance

${{{{if}\mspace{14mu}\gamma} = {2\mspace{14mu}{and}\mspace{14mu} 3}},{{then}\mspace{14mu} 2\mspace{14mu}\log\mspace{14mu}{1\;}^{\frac{1}{\delta}}\mspace{20mu}{and}\mspace{14mu} 6\mspace{14mu}\log^{\frac{1}{\delta}}}}\mspace{14mu}$words suffice, respectively. For arbitrary γ, the storage requirement is

$2\;\gamma^{2}\log\mspace{14mu} 1^{\frac{1}{\delta}}$words in the worst case.

When there are a small number of errors (at most γ), PIRS^(γ) does notraise an alarm, which gives some leeway to the server 110. This is oftennecessary so that the server can cope with large volumes of incomingdata using some semantic load shedding strategies. However, in somecritical applications, if the client 120 demands complete correctness,PIRS^(γ) may not be sufficient, since it may only indicate to the client120 if there are less than γ errors, but not where they are. In someembodiments, a strengthened version of PIRS^(γ), referred to herein asPIRS^(γ)*, can be implemented, that is able to identify which groups areaffected by errors, and even compute the correct sums for the affectedgroups by taking advantage of a technique based on the binarydecomposition of the group identifier.

Applying the binary decomposition to PIRS^(γ), the amount of informationkept about each bucket is increased. In addition to keeping a PIRSsynopsis of all items which fall into a given bucket, 2┌log n┐ number ofPIRS synopses are maintained and arranged as a two-dimensional array Aof size ┌log n┐×2. When an update to group i is placed into bucket b(i),the PIRS in A[j, bit(i, j)] is updated, for all 1≦j≦┌log n┐×2, wherebit(i, j) denotes the jth bit in the binary representation of i.

To perform a query verification, the array A of PIRS synopses iscomputed for both the vector v and the answer w for each bucket. If allcorresponding entries match, then (with high probability) there is noerroneous components in the bucket. If, for any j, the PIRS in both A[j,0] and A[j, 1] do not match, then this indicates that there is more thanone erroneous component in this bucket, because a single erroneous icannot contaminate both A[j, 0] and A[j, 1]. Otherwise, there must beexactly one erroneous component falling into this bucket. This is thecase for all erroneous components with high probability, providing thatthere are at most γ such components. In this case, for each j, exactlyone of A[j, 0] and A[j, 1] do not match. If it is A[j, 1], thisindicates that the jth bit of the identifier i of the erroneous group is1 otherwise, it is 0. Using ┌log n┐ pairs of PIRS, the identifier cantherefore be recovered exactly.

The erroneous components w_(i) in the answer w returned by the server110 can be located. Moreover, enough information to recover each truev^(i) for each wrong result can exist. For example, suppose the bucketat layer l contains exactly one error, which is v_(i). Note that thedata owner will return X_(b) _(t) _((i))(v)=Π_(i)(α−j)^(ν) ^(j) . Sinceonly v_(i) is unknown in this equation, v_(i) can be found by solvingthe equation. If PIRS-2 is used, the equation becomes as X_(b) _(t)_((i))(v)=Σ_(j)ν_(j)α^(j). Thus, v_(i) can be found efficiently usingfield subtractions and divisions.

In PIRS^(γ)*, each PIRS in PIRS^(γ) is replaced with an array of O(logn) PIRS, so the space and time increases by an O(log n) factor.PIRS^(γ)* requires

$O( {\gamma^{2}\log\frac{1}{\delta}\log\mspace{14mu}{n( {{\log\mspace{14mu} m} + {\log\mspace{14mu} n}} )}} )$bits, spends

$O( {\log\frac{1}{\delta}\log\mspace{14mu} n} )$time to process a tuple in the stream, and

$O( {{w}( {\gamma + {\log\frac{m}{w}}} )\log\frac{1}{\delta}\log\mspace{14mu} n} )$time to perform a verification. For any w≠_(γ)v, PIRS^(γ)* raises analarm with probability 1−δ; for any w=_(γ)v, PIRS^(γ)* does not raise analarm but correctly identifies and recovers the errors in the answer wwith probability 1−δ.

When the number of errors, for example λ, is no more than γ, PIRS^(γ)*can recover all the errors with high probability. When λ>γ, there may betoo many errors to expect a complete recovery of all the query results.Nevertheless, PIRS^(γ)* can recover a good portion of the results. Forthis analysis, precision and recall are used to measure the performanceof the synopsis. Precision refers to the probability that an identifiederror is truly an actual error. Since PIRS does not have falsepositives, precision is always 1. Recall, is the percentage of theactual errors that have been recovered, or equivalently, the probabilitythat any one error has been captured by the synopsis. For any givenerror ε, if the error ε falls into a bucket by itself in any of thelayers, then PIRS^(γ)* can correctly recover it. For a particular layer,because the errors are distributed into the bucketspairwise-independently and there are c₁γ² buckets, the probability thatthe bucket containing ε is the same as the bucket for any of the otherλ−1 errors is at most λ/(c₁γ²) following the union bound. Since thelog(1/δ) layers are mutually independent, the probability that thiscollision happens in all layers is

$( \frac{\lambda}{c_{1}\gamma^{2}} )^{\log\frac{1}{\delta}} = {{\delta^{\Omega}( {\log( {\gamma^{2}/\lambda} )} )}.}$

When there are λ>γ errors, PIRS^(γ)* raises an alarm with probability1−δ and recovers the errors with a recall of 1−δ^(Ω(log(γ) ² ^(/λ))).

The exact solution is advantageous when only a small number of errorscan be tolerated. In applications where γ is large, the quadratic spacerequirement can be prohibitive. If alarms can be raised whenapproximately γ errors have been observed, space-efficient synopsis canbe implemented. This approximation is often acceptable since when γ islarge, users may not be concerned if the number of errors detecteddeviates from γ by a small amount. An approximate solution, denoted withPIRS^(±γ), guarantees that the PIRS^(±γ) raises no alarm withprobability at least 1−δ on any w=_(γ) ⁻ v where

$\gamma^{-} = {( {1 - \frac{c}{\ln\mspace{14mu}\gamma}} ){\gamma.}}$and raises an alarm with probability at least 1−δ on any w≠_(γ) ₊ vwhere

${\gamma^{+} = {( {1 + \frac{c}{\ln\mspace{14mu}\gamma}} )\gamma}},$for and constant c>−ln ln 2≈0.367. The multiplicative approximationratio

$1 \pm \frac{c}{\ln\mspace{14mu}\gamma}$is close to 1 for large γ.

PIRS^(±γ) also contains multiple layers of buckets, where each bucket isassigned a subset of the components of vector v and summarized usingPIRS (FIG. 2). Focusing on one layer only, one goal is, for any w=_(γ) ⁻v, to not raise an alarm with probability at least ½+ε for some constantε∈(0, ½), and on any w≠_(γ) ₊ v to raise an alarm with probability atleast ½+ε. By using

$O( {\log\frac{1}{\delta}} )$independent layers and reporting the majority of the results, theprobabilistic guarantee will be boosted to 1−δ using Chernoff boundsdescribed in “Randomized Algorithms”, by Motwani et al., the subjectmatter of which is incorporated by reference in its entirety.

As an example, let k be the number of buckets per layer. The componentsof vector v are distributed into the k buckets in a γ⁺-wise independentfashion, and for each bucket the PIRS summary of those components iscomputed using δ′=1/γ². Given some answer w, let this layer raise analarm only if all the k buckets report alarms. If the answer w containsmore than γ⁺ erroneous members, then the probability that every bucketgets at least one such component is high; and if the answer w containsfewer than γ⁻ erroneous members, then the probability that there existssome bucket that is not assigned any erroneous members is also high.

One factor that determines whether a layer could possibly raise an alarmis the distribution of erroneous components into buckets. The event thatall buckets raise alarms is only possible if each bucket contains atleast one inconsistent component. Consider all the inconsistentcomponents in the answer w in some order, for example w₁, w₂, . . . ,where of each of them can be considered a collector that randomly picksa bucket to “collect”. Assume that there are enough inconsistentelements, and let the random variable Y denote the number ofinconsistent components required to collect all the buckets (i.e., Y isthe smallest i such that w₁, . . . , w_(i) have collected all thebuckets). The problem becomes an instantiation of the coupon collector'sproblem (viewing buckets as coupons and erroneous components as trials).With k buckets, it is known that E(Y)=k ln k+O(k), therefore k is setsuch that γ=┌k ln k┐. It can be seen that k=O(γ/ln γ), hence the desiredstorage requirement.

For any constant c′,Pr[Y≦k((ln k−c′)]≦e ^(−e) ^(c′) +o(1),  (13)Pr[Y≧k((ln k+c′)]≦1−e ^(−e) ^(−c′) +o(1),  (14)where o(1) depends on k.

Notice that ln γ≦2 ln k for any k≧2, so the above equations also impliesthat for any real constant c:

$\begin{matrix}{{{\Pr\lbrack {{Y \leq {\gamma - {c\frac{\gamma}{\ln\mspace{14mu}\gamma}}}} = \gamma^{-}} \rbrack} \leq {{\mathbb{e}}^{- e^{c}} + {o(1)}}},} & (15) \\{{\Pr\lbrack {{Y \geq {\gamma + {c\frac{\gamma}{\ln\mspace{14mu}\gamma}}}} = \gamma^{+}} \rbrack} \leq {1 - {\mathbb{e}}^{- e^{- c}} + {{o(1)}.}}} & (16)\end{matrix}$

If w=_(γ) ⁻ v, then the probability that these fewer than γ⁻ independenterroneous components cover all buckets is bounded by equation (15),which is also the upper bound for the probability that the layer raisesan alarm. Thus, for any c≧0, the probability of raising a false alarm is(for large enough γ) at moste ^(−e) ^(e) ≦1/e.  (17)If w≠_(γ) ₊ v, then considering γ⁺ of the inconsistent components, whichare independently distributed to the buckets, there are two cases inwhich a true alarm is not raised. First, when the γ⁺ components do notcover all buckets and second when all the buckets are covered but atleast one of them fails to report an alarm. The probability that thefirst case occurs is bounded by equation (16), while the probabilitythat the second case occurs is at most 1−(1−δ′)^(k). By the union bound,the total probability that a false negative is produced is at most

$\begin{matrix}{{1 - {\mathbb{e}}^{- e^{- c}} + {o(1)} + 1 - ( {1 - \delta^{\prime\;}} )^{k}} \leq {2 - {\mathbb{e}}^{e^{- c}} - 2^{- \frac{2}{\gamma}} + {{o(1)}.}}} & (18)\end{matrix}$For γ large enough, there exists a constant ε>0 such that thisprobability is at most ½−ε for any c>−ln ln 2.

To summarize, if c>−ln ln 2≈0.367, then both the false positive andfalse negative probabilities are at most ½−ε for some constant ε at onelayer with k=O(γ/log γ) buckets.

To drive down the error probabilities for both false positives and falsenegatives to δ, l=O(log(1/δ)) layers are used and the simple majority oftheir “votes” is reported. This probability is quantified for falsenegatives; the other case is symmetric.

Each layer can be viewed as a coin flip that raises a true alarm withprobability at least ½+ε. Let the random variable Z denote the number oflayers that raise alarms. This process is a sequence of independentBernoulli trials, hence Z follows the binomial distribution. For lindependent layers, the expectation of Z is at least μ=(½+ε)l. By theChernoff bound, the probability that a majority of layers raise alarmsis

$\begin{matrix}{{\Pr\lbrack {Z < {\frac{1}{2}\ell}} \rbrack} = {{\Pr\lbrack {Z < {( {1 - \frac{2\; ɛ}{1 + {2\; ɛ}}} )\mu}} \rbrack} < {{\mathbb{e}}^{{- \frac{\mu\mspace{11mu}}{2}}{(\frac{{2}\;}{{1 + {2}}\;})}^{2}}.}}} & (19)\end{matrix}$

Therefore, it is ensured that

${{\mathbb{e}}^{{- \frac{\mu}{2}}{(\frac{2}{1 + {2}})}^{2}} \leq \delta},$which can be satisfied by taking

ℓ = ⌈ 1 + 2 ⁢ ⁢ ɛ ⁢ ln ⁢ 1 δ ⌉ .

A γ⁺-wise independent random hash function is generated to map groups tobuckets. Using standard techniques, such a function can be generatedusing O(γlog n) truly random bits. Specifically, the technique forconstructing t-universal hash families can be used. Let p be some primebetween n and 2n, and α₀, . . . , α_(γ−1) be γ random numbers chosenuniformly and independently from

. Then we set

$\begin{matrix}{{b(i)} = {{\sum\limits_{j = 0}^{t - 1}{\alpha_{j}i^{j}{mod}\mspace{14mu} k}} + 1.}} & (20)\end{matrix}$

This function is guaranteed to be drawn from a t-wise independent familyof functions (so that, over the random choice of the function, theprobability of t items colliding under the hash function is 1/k^(t−1)).For an incoming tuple s=(i, u), b(i) is computed using the α_(j)'s inO(γ) time (using Horner's rule), and then perform the update to thecorresponding PIRS. This requires the storage of O(γ⁺)=O(γ) truly randomnumbers per layer. As a result, PIRS^(±γ) uses

$O( {\gamma\;\log\frac{1}{\delta}( {{\log\mspace{14mu} m} + {\log\mspace{14mu} n}} )} )$bits of space, spends

$O( {\gamma\;\log\frac{1}{\delta}} )$time to process an update and

$O( {{w}( {\gamma + {\log\frac{m}{w}}} )\mspace{14mu}\log\frac{1}{\delta}} )$time to perform a verification. By allowing two-sided errors, the sizeof the synopsis can be reduced from quadratic in γ to linear.

An improved solution, FM-PIRS for the CQV^(γ) problem can be generated,whose size and update cost only depend on the degree of approximation,but not γ, thus allowing it to scale well with γ. FM-PIRS directlyestimates the number of errors in the result provided by the server, andthen compares the estimate with γ. As a result, FM-PIRS can also supporta wider range of values of γ, which can be given only at verificationtime. For small values of γ, the bounds and guarantees of PIRS^(γ) andPIRS^(±γ) are preferred, but for larger values of γ, the cost of FM-PIRSis preferable.

As the name suggest, FM-PIRS is a combination of PIRS and an FM sketch,which is described in “Probabilistic Counting Algorithms For databaseApplications”, by Flajolet et al., the subject matter of which isincorporated herein by reference in its entirety. The FM sketch is usedto estimate the number of distinct elements in a stream. The FM sketchis described as follows. Suppose that the universe is [n]={1, . . . ,n}. A random hash function h:[n]→[2^(L)−1] is picked such that any h(i)is uniformly distributed over [2^(L)−1], where L=O(log n). For eachelement i in the stream h(i) is computed. The number of trailing zerosin the binary representation of h(i) are denoted by r(i). The FM sketchcomputes R=max{r(i), for all i in the stream} and then outputsk/φ·2^((R) ¹ ^(+ . . . +R) ^(k) ^()/k), where φ≈0.7735. To improveaccuracy, the universe is divided into k partitions using another randomuniform hash function, and an R_(j) is computed for each partition.Finally the output is k/φ·2^((R) ¹ _(+ . . . +R) ^(k) ^()/k).

To illustrate the FM-PIRS synopsis, the basic FM sketch with k=1 is usedalthough those skilled in the art will recognize that generalization tolarger k is possible. Each “wrong” group i can be treated such thatν_(i)≠w_(i) is a distinct element in the universe [n], and thenR=max{r(i), for all wrong groups i} can be computed. Generally, the dataowner does not know whether i is a wrong group, so r(i) cannot becomputed directly. Instead, a number L of PIRS's X₁, . . . , X_(L) withδ′=δ/L are created. For any i, group i is put into X_(j) if j≦r(i). ThusX₁ gets half of the groups, X₂ gets a quarter of the groups, etc.

The value of R can be computed as follows. When all of X₁, . . . , X_(L)correctly capture the errors in them, which happens with probability atleast 1−δ′L=1−δ, R=arg max_(j) {X_(j) raises an alarm}.

For k partitions, fix any k, FM-PIRS has a size of O(k log n(log m+logn)) bits, processes a tuple in expected time O(1), and computes anestimate of the number of errors in the result in expected time O(|w|logm|w|). With probability at least 1−δ, the estimate has a bias bounded by1+0.31/k and a standard error of 0.78/√k.

Since each partition keeps L=O(log n) PIRS's, the overall size ofFM-PIRS is O(k log n(log m+log n)) bits. For an incoming tuple, only onepartition gets affected, but 0 to L of the PIRS's in this partition canget updated. Since the hash function h is uniform, the expected numberof PIRS's updated is O(1). Upon receiving the FM-PIRS synopses of vectorv and an answer w from the server, we need to spend O(log w_(i))expected time per non-zero entry of the answer w to compute the FM-PIRSsynopses of the answer w. So the expected time needed for an estimationis

${O( {\sum\limits_{i}{\log\mspace{14mu} w_{i}}} )} = {{O( {{w}\mspace{14mu}\log\frac{m}{w}} )}.}$

An analytical comparison of PIRS^(±γ) and FM-PIRS can be provided. SinceFM-PIRS computes an estimate of the number of errors in the answer w,FM-PIRS can be used to do the same task for which PIRS^(±γ) is designed.For a fair comparison, k is set such that FM-PIRS provides the sameprobabilistic guarantee that PIRS^(±γ) does. Since the standard error ofFM-PIRS is O(1/√{square root over (k)}) and PIRS^(±γ) allows a deviationof O(1/ln γ).

By setting k=O(log² γ), it can be guaranteed that FM-PIRS captures bothfalse positives and false negatives with good probabilities (e.g.,greater than ¾). Finally, by using

$O( {\log\frac{1}{\delta}} )$independent copies of FM-PIRS and taking the median, the successprobability can be boosted to 1−δ, the same as what PIRS^(±γ)guarantees. Finally, only L=O(log γ) are needed since estimating thenumber of errors when there are over, for example, 2γ of them is notrequired.

Under this configuration, FM-PIRS uses

$ {{O( {\log^{3}{\gamma( {{\log\mspace{14mu} m} + {\log\mspace{14mu} n}} )}} )}\log\mspace{14mu}\frac{1}{\delta}} )$bits of space. Thus, asymptotically (as γ grows) FM-PIRS is better thanPIRS^(±γ). However, for small γ PIRS^(±γ) should be better in terms ofsize, while FM-PIRS becomes better when γ exceeds some large threshold.

FIG. 3 depicts an exemplary computing device 300 for implementingembodiments of the present invention. The computing device 300 can be amainframe, personal computer (PC), laptop computer, workstation,handheld device, such as a PDA, or the like. The computing device 300can be implemented as the server 110, the client 120, and/or the dataowner 130. In the illustrated embodiment, the computing device 300includes a central processing unit (CPU) 302 and preferably a displaydevice 304. The display device 304 enables the computing device 300 tocommunicate directly with an operator through a visual display. Thecomputing device 300 can further include data entry device(s) 306, suchas a keyboard, touch screen, and/or mouse. The computing device 300 caninclude storage 308 for storing data and instructions. The storage 308can include such technologies as a floppy drive, hard drive, tape drive,Flash drive, optical drive, read only memory (ROM), random access memory(RAM), and the like.

Applications 310 can be resident in the storage 308. The applications310 can include instructions for implementing embodiments of the presentinvention. For embodiments where the computing device 300 is implementedas the data owner 130, the applications 310 can include instructions forimplementing the query verifier 132. For embodiments where the computingdevice 300 is implemented as the untrusted server 110, the applications310 can include instructions for implementing the DSMS 112. Forembodiments where the computing device is implemented as the client 120,the applications 310 can include instructions for implementing thequeries, as well as for implementing the verification of query resultsusing the synopsis 134 generated by the data owner. The storage 308 canbe local or remote to the computing device 300. The computing device 300includes a network interface 312 for communicating with a network. TheCPU 302 operates to run the application in storage 308 by performinginstructions therein and storing data resulting from the performedinstructions, which may be presented to an operator via the display 304or by other mechanisms known to those skilled in the art, such a printout from a printer.

Although preferred embodiments of the present invention have beendescribed herein with reference to the accompanying drawings, it is tobe understood that the invention is not limited to those preciseembodiments and that various other changes and modifications may beaffected herein by one skilled in the art without departing from thescope or spirit of the invention, and that it is intended to claim allsuch changes and modifications that fall within the scope of theinvention.

1. A computer-implemented method of verifying a query result of anuntrusted server comprising: outsourcing, by a computing device, a datastream to the untrusted server, the untrusted server configured torespond to a query with the query result; generating, by the computingdevice, a verification synopsis based on at least a portion of the queryresult and a seed, the verification synopsis including a polynomial,coefficients of the polynomial being determined based on the seed;outputting, by the computing device, the verification synopsis and theseed to a client for verification of the query result; and locating andcorrecting errors in the query result, by the computing device, usingthe verification synopsis.
 2. The method of claim 1, wherein a value ofthe seed is undisclosed by the data owner until the seed is output tothe client, the seed being required by the client to use theverification synopsis.
 3. The method of claim 1 further comprisingraising an alarm when the verification synopsis and the query results donot match.
 4. The method of claim 1 further comprising raising an alarmwhen the number of errors between the verification synopsis and thequery results exceeds a threshold.
 5. The method of claim 1 furthercomprising estimating the number of errors in the query result using theverification synopsis.
 6. The method of claim 1 further comprisingperforming verification of the query result in a single pass of thequery result.
 7. The method of claim 1 further comprising: maintaining avector corresponding to the query result; and computing the verificationsynopsis using the vector when a request for verification is received.8. The method of claim 7 further comprising generating a plurality oflayers, each having a plurality of buckets, one of the buckets beingrepresented by the verification synopsis and elements of the vectorbeing assigned to one bucket per layer.
 9. The method of claim 7 furthercomprising: computing a result synopsis using the query result; andcomparing the result synopsis with the verification synopsis to verifythe query result.
 10. The method of claim 1, wherein the verificationsynopsis uses at most 3 words of memory.
 11. A non-transitorycomputer-readable medium comprising instructions, wherein execution ofthe instructions by at least one computing device facilitates verifyinga query result of an untrusted server by: outsourcing a data stream tothe untrusted server, the untrusted server configured to respond to aquery with the query result; generating a verification synopsis using atleast a portion of the query result and a seed, the verificationsynopsis including a polynomial, coefficients of the polynomial beingdetermined based on the seed; outputting the verification synopsis andthe seed to a client for verification of the query result; and locatingand correcting errors in the query result using the verificationsynopsis.
 12. The method non-transitory computer-readable medium ofclaim 11, wherein a value of the seed is undisclosed by the data owneruntil the seed is output to the client, the seed being required by theclient to use the verification synopsis.
 13. The non-transitorycomputer-readable medium of claim 11, wherein execution of theinstructions by at least one computing device further facilitatesverifying a query result of an untrusted server by: maintaining a vectorcorresponding to the query result; and computing the verificationsynopsis using the vector when a request for verification is received.14. The non-transitory computer-readable medium of claim 13, whereinexecution of the instructions by at least one computing device furtherfacilitates verifying a query result of an untrusted server bygenerating a plurality of layers, each having a plurality of buckets,one of the buckets being represented by the verification synopsis andelements of the vector being assigned to one bucket per layer.
 15. Thenon-transitory computer-readable medium of claim 13, wherein executionof the instructions by at least one computing device further facilitatesverifying a query result of an untrusted server by: computing a resultsynopsis using the query result; and comparing the result synopsis withthe verification synopsis to verify the query result.
 16. A system forverifying a query result of an untrusted server comprising: a computingdevice associated with a data owner configured to generate averification synopsis using a vector associated with query results fromthe untrusted server, the verification synopsis including a polynomialand a seed, the computing device associated with the data owner beingconfigured to output the verification synopsis and the seed to acomputing device associated with a client being configured to verify ananswer to a query using the verification synopsis, the computing deviceassociated with the client being configured to locate and correct errorsin the query result using the verification synopsis.
 17. The system ofclaim 16, wherein the seed is undisclosed by the data owner until theseed is output to the client, the seed being required by the client touse the verification synopsis.
 18. The system of claim 16, wherein thecomputing device associated with the data owner is configured to computethe verification synopsis when a request for verification is receivedfrom the client.
 19. The system of claim 16, wherein the computingdevice associated with the data owner is configured to generate alayered verification synopsis, the layered verification synopsisincluding a plurality of layers, each having a plurality of buckets, oneof the buckets being represented by the verification synopsis andelements of the vector being assigned to one bucket per layer.